PDD Crawler: A focused web crawler using link and content analysis for relevance prediction
نویسندگان
چکیده
Majority of the computer or mobile phone enthusiasts make use of the web for searching activity. Web search engines are used for the searching; The results that the search engines get are provided to it by a software module known as the Web Crawler. The size of this web is increasing round-the-clock. The principal problem is to search this huge database for specific information. To state whether a web page is relevant to a search topic is a dilemma. This paper proposes a crawler called as “PDD crawler” which will follow both a link based as well as a content based approach. This crawler follows a completely new crawling strategy to compute the relevance of the page. It analyses the content of the page based on the information contained in various tags within the HTML source code and then computes the total weight of the page. The page with the highest weight, thus has the maximum content and highest relevance.
منابع مشابه
Pdd Crawler: a Focused Web Crawler Using Link and Content Analysis for Relevence Prediction
Majority of the computer or mobile phone enthusiasts make use of the web for searching activity. Web search engines are used for the searching; The results that the search engines get are provided to it by a software module known as the Web Crawler. The size of this web is increasing round-the-clock. The principal problem is to search this huge database for specific information. To state whethe...
متن کاملPrioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملFocused crawling for both relevance and quality of medical information
Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful. To address problems of cost, coverage and quality, we ...
متن کاملImproving the performance of focused web crawlers
This work addresses issues related to the design and implementation of focused crawlers. Several variants of state-of-the-art crawlers relying on web page content and link information for estimating the relevance of web pages to a given topic are proposed. Particular emphasis is given to crawlers capable of learning not only the content of relevant pages (as classic crawlers do) but also paths ...
متن کاملDesign of Improved Web Crawler By Analysing Irrelevant Result
A key issue in designing a focused Web crawler is how to determine whether an unvisited URL is relevant to the search topic. Effective relevance prediction can help avoid downloading and visiting many irrelevant pages. In this module, we propose a new learning-based approach to improve relevance prediction in focused Web crawlers. For this study, we chose Naïve Bayesian as the base prediction m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1411.4366 شماره
صفحات -
تاریخ انتشار 2014